feat(job-distributor): add exp. backoff retry to `feeds.SyncNodeInfo()` #15752

gustavogama-cll · 2024-12-18T02:13:49Z

There’s a behavior that we’ve observed for some time on the NOP side where they will add/update a chain configuration of the Job Distributor panel but the change is not reflected on the service itself. This leads to inefficiencies as NOPs are unaware of this and thus need to be notified so that they may "reapply" the configuration.

After some investigation, we suspect that this is due to connectivity issues between the nodes and the job distributor instance, which causes the message with the update to be lost.

This PR attempts to solve this by adding a "retry" wrapper on top of the existing SyncNodeInfo method. We rely on avast/retry-go to implement the bulk of the retry logic. It's configured with a minimal delay of 10 seconds, maximum delay of 30 minutes and retry a total of 56 times -- which adds up to a bit more than 24 hours.

DPA-1371

core/config/toml/types.go

core/services/feeds/service.go

github-actions · 2024-12-18T02:40:56Z

AER Report: CI Core ran successfully ✅

aer_workflow , commit

AER Report: Operator UI CI ran successfully ✅

aer_workflow , commit

graham-chainlink · 2024-12-19T02:36:51Z

Hmm i wonder would this solve the connection issue?

If there is communication issue between node and JD, how would the auto sync help resolve it? It will try and it will fail right?

Alternatively would it be better to have some kind of exponential backoff retry when it does fail during the sync instead? (not that it will solve a permanent connection issue)

core/services/feeds/service.go

gustavogama-cll · 2024-12-20T05:56:14Z

Alternatively would it be better to have some kind of exponential backoff retry when it does fail during the sync instead? (not that it will solve a permanent connection issue)

As discussed earlier today, I went ahead and implemented your suggestion. I ran a few manual tests and it seems to work as expected, though I had to add some extra logic around the context instances to get there.

I still feel the background goroutine would be more resilient. But, on the other hand, this option does not require any runtime configuration -- I think we can safely hardcode the retry parameters -- which is a huge plus to me.

graham-chainlink · 2024-12-21T05:51:25Z

I still feel the background goroutine would be more resilient. But, on the other hand, this option does not require any runtime configuration -- I think we can safely hardcode the retry parameters -- which is a huge plus to me.

Thanks @gustavogama-cll, yeah the background go-routine definitely has its pros, both approaches are valid, just that for me i think the retry is simpler.

core/services/feeds/service.go

cl-sonarqube-production · 2025-01-15T03:46:13Z

Quality Gate passed

Issues
1 New issue
1 Fixed issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

github-actions · 2025-02-19T22:25:18Z

Flakeguard Summary

Ran new or updated tests between develop and 4ff90cc (dpa-1371-feat-periodic-sync-node-info-job-distributor).

View Flaky Detector Details | Compare Changes

Found Flaky Tests ❌

2 Results

Name	Pass Ratio	Panicked?	Timed Out?	Race?	Runs	Successes	Failures	Skips	Package	Package Panicked?	Avg Duration	Code Owners
Test_Service_syncNodeInfoWithRetry	0%	false	false	false	3	0	3	0	github.com/smartcontractkit/chainlink/v2/core/services/feeds	false	0s	@smartcontractkit/deployment-automation, @smartcontractkit/core
Test_Service_syncNodeInfoWithRetry/more_errors_than_MaxAttempts	0%	false	false	false	3	0	3	0	github.com/smartcontractkit/chainlink/v2/core/services/feeds	false	1.016666666s	@smartcontractkit/deployment-automation, @smartcontractkit/core

Artifacts

For detailed logs of the failed tests, please refer to the artifact failed-test-results-with-logs.json.

github-actions · 2025-02-20T22:44:33Z

Flakeguard Summary

Ran new or updated tests between develop and 8fa7efc (dpa-1371-feat-periodic-sync-node-info-job-distributor).

View Flaky Detector Details | Compare Changes

Found Flaky Tests ❌

2 Results

Name	Pass Ratio	Panicked?	Timed Out?	Race?	Runs	Successes	Failures	Skips	Package	Package Panicked?	Avg Duration	Code Owners
Test_Service_syncNodeInfoWithRetry	0%	false	false	false	3	0	3	0	github.com/smartcontractkit/chainlink/v2/core/services/feeds	false	0s	@smartcontractkit/deployment-automation, @smartcontractkit/core
Test_Service_syncNodeInfoWithRetry/more_errors_than_MaxAttempts	0%	false	false	false	3	0	3	0	github.com/smartcontractkit/chainlink/v2/core/services/feeds	false	1.016666666s	@smartcontractkit/deployment-automation, @smartcontractkit/core

Artifacts

For detailed logs of the failed tests, please refer to the artifact failed-test-results-with-logs.json.

graham-chainlink

looks good

core/services/feeds/service.go

There’s a behavior that we’ve observed for some time on the NOP side where they will add/update a chain configuration of the Job Distributor panel but the change is not reflected on the service itself. This leads to inefficiencies as NOPs are unaware of this and thus need to be notified so that they may "reapply" the configuration. After some investigation, we suspect that this is due to connectivity issues between the nodes and the job distributor instance, which causes the message with the update to be lost. This PR attempts to solve this by adding a "retry" wrapper on top of the existing `SyncNodeInfo` method. We rely on `avast/retry-go` to implement the bulk of the retry logic. It's configured with a minimal delay of 10 seconds, maximum delay of 30 minutes and retry a total of 56 times -- which adds up to a bit more than 24 hours. Ticket Number: DPA-1371

…stributors

cl-sonarqube-production · 2025-02-24T20:06:29Z

Quality Gate passed

Issues
0 New issues
1 Fixed issue
1 Accepted issue

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

gustavogama-cll commented Dec 18, 2024

View reviewed changes

core/config/toml/types.go Outdated Show resolved Hide resolved

gustavogama-cll commented Dec 18, 2024

View reviewed changes

core/services/feeds/service.go Show resolved Hide resolved

gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch 3 times, most recently from 301972c to 83b1842 Compare December 18, 2024 05:51

graham-chainlink reviewed Dec 19, 2024

View reviewed changes

core/services/feeds/service.go Outdated Show resolved Hide resolved

core/services/feeds/service.go Outdated Show resolved Hide resolved

gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch 2 times, most recently from 57b55cc to c5d0079 Compare December 20, 2024 04:34

gustavogama-cll changed the title ~~feat(job-distributor): periodically sync node info with job distributors~~ feat(job-distributor): add exp. backoff retry to feeds.SyncNodeInfo() Dec 20, 2024

gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch 2 times, most recently from a1a4281 to 61297ab Compare December 20, 2024 05:17

gustavogama-cll requested a review from graham-chainlink December 20, 2024 18:34

graham-chainlink reviewed Dec 21, 2024

View reviewed changes

gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch from 61297ab to 5c30694 Compare December 27, 2024 02:32

gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch 6 times, most recently from e540377 to b2f8386 Compare January 15, 2025 03:35

gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch from b2f8386 to 4ff90cc Compare February 19, 2025 22:16

gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch from 4ff90cc to cd86fbb Compare February 19, 2025 22:53

gustavogama-cll marked this pull request as ready for review February 20, 2025 03:05

gustavogama-cll requested a review from a team as a code owner February 20, 2025 03:05

gustavogama-cll requested review from a team as code owners February 20, 2025 03:05

gustavogama-cll requested a review from jmank88 February 20, 2025 03:05

gustavogama-cll marked this pull request as draft February 20, 2025 22:34

gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch from cd86fbb to 8fa7efc Compare February 20, 2025 22:35

gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch from 8fa7efc to 6abb8aa Compare February 21, 2025 05:02

gustavogama-cll marked this pull request as ready for review February 21, 2025 05:32

graham-chainlink reviewed Feb 24, 2025

View reviewed changes

core/services/feeds/service.go Show resolved Hide resolved

core/services/feeds/service.go Outdated Show resolved Hide resolved

gustavogama-cll added 4 commits February 24, 2025 16:49

review: protect cancel func access with a mutex to avoid race conditions

7d631a7

review: trigger retry on partial failures and support multiple job di…

d7e6aa9

…stributors

review: clear contexts before closing the connection manager

a42860e

gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch from 6abb8aa to a42860e Compare February 24, 2025 19:50

graham-chainlink approved these changes Feb 25, 2025

View reviewed changes

jkongie approved these changes Feb 26, 2025

View reviewed changes

gustavogama-cll added this pull request to the merge queue Feb 26, 2025

Merged via the queue into develop with commit 39d0909 Feb 26, 2025
167 checks passed

gustavogama-cll deleted the dpa-1371-feat-periodic-sync-node-info-job-distributor branch February 26, 2025 17:34

This was referenced Feb 26, 2025

[DO NOT MERGE] Changeset Release Preview - v2.21.0 #13148

Draft

[DO NOT MERGE] Changeset Release Preview - v2.21.0 mikeyhodl/chainlink#830

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(job-distributor): add exp. backoff retry to `feeds.SyncNodeInfo()` #15752

feat(job-distributor): add exp. backoff retry to `feeds.SyncNodeInfo()` #15752

gustavogama-cll commented Dec 18, 2024 •

edited by jira bot

Loading

github-actions bot commented Dec 18, 2024 •

edited

Loading

graham-chainlink commented Dec 19, 2024 •

edited

Loading

gustavogama-cll commented Dec 20, 2024

graham-chainlink commented Dec 21, 2024

cl-sonarqube-production bot commented Jan 15, 2025

github-actions bot commented Feb 19, 2025

github-actions bot commented Feb 20, 2025

graham-chainlink left a comment

cl-sonarqube-production bot commented Feb 24, 2025

feat(job-distributor): add exp. backoff retry to feeds.SyncNodeInfo() #15752

feat(job-distributor): add exp. backoff retry to feeds.SyncNodeInfo() #15752

Conversation

gustavogama-cll commented Dec 18, 2024 • edited by jira bot Loading

github-actions bot commented Dec 18, 2024 • edited Loading

AER Report: CI Core ran successfully ✅

AER Report: Operator UI CI ran successfully ✅

graham-chainlink commented Dec 19, 2024 • edited Loading

gustavogama-cll commented Dec 20, 2024

graham-chainlink commented Dec 21, 2024

cl-sonarqube-production bot commented Jan 15, 2025

Quality Gate passed

github-actions bot commented Feb 19, 2025

Flakeguard Summary

Found Flaky Tests ❌

Artifacts

github-actions bot commented Feb 20, 2025

Flakeguard Summary

Found Flaky Tests ❌

Artifacts

graham-chainlink left a comment

Choose a reason for hiding this comment

cl-sonarqube-production bot commented Feb 24, 2025

Quality Gate passed

feat(job-distributor): add exp. backoff retry to `feeds.SyncNodeInfo()` #15752

feat(job-distributor): add exp. backoff retry to `feeds.SyncNodeInfo()` #15752

gustavogama-cll commented Dec 18, 2024 •

edited by jira bot

Loading

github-actions bot commented Dec 18, 2024 •

edited

Loading

graham-chainlink commented Dec 19, 2024 •

edited

Loading